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Abstract 

We report performance benchmarks for several algorithms that 
we have used to simulate the Schrodinger functional with two fla- 
vors of dynamical quarks. They include hybrid and polynomial 
hybrid Monte Carlo with preconditioning. An appendix describes 
a method to deal with autocorrelations for nonlinear functions of 
primary observables as they are met here due to reweighting. 



1 Introduction 

In the past years the ALPHA Collaboration has pursued the goal to reliably 
compute the QCD gauge coupling at high energy in terms of non-perturbative 
low energy parameters. The concomitant necessity to deal with a large energy 
ratio in the continuum limit was solved by a breakup into recursive steps. 
Here one employs finite size rescaling by repeated factors of two and extrap- 
olates to the continuum each step by itself. By a combination of theoretical 
reasoning and numerical tests the Schrodinger functional was determined as 
a particularly convenient framework for this purpose. The programme has 
been completed for the quenched approximation, see refs. [[J], 0] for reviews 
of the approach and ref. || for a summary of data. First tests with a non 
vanishing flavour number have been reported ||. As is well-known, by the in- 
clusion of dynamical quarks the numerical cost is boosted by a large factor. 
The importance of algorithmic optimization can hence hardly be overesti- 
mated. The finite size technique with the Schrodinger functional - - beside 
its uses for QCD physics -- offers the possibility of an investigation of the 
lattice spacing dependence of the performance of fermion algorithms with 
all physical scales held fixed. Here we report on such results for several 
algorithms. 

The Schrodinger functional can be regarded as the free energy T of QCD 
in a finite volume L 3 x T, 

exp(-r) = J D[U\D®\D[i/>]exp(-S[Urf,il>]). (1.1) 

The action S consists of the usual plaquette action for SU(3) gauge fields 
U and two degenerate flavours of clover-improved Wilson fermions. The 
box is periodic in space, and fixed gluon potentials and vanishing quark 
fields[] are prescribed on the temporal boundaries. The boundary potentials 
C and C at Xo = and Xq = T are specified in terms of the scale L and 
dimensionless parameters, one of which is called rj and is kept variable. A 
convenient practical choice with T = L, introduced as point "A" in ||, is used 
throughout. Quark fields are periodic in space up to a phase 9 = tt/5. An 
Abelian background field is induced which can be varied by changing rj. The 
response to such an infinitesimal variation is used to define the renormalized 

1 Non-vanishing quark sources are also possible but will not be needed here. 



coupling 

where To/^q is the tree level value of T for bare coupling g Q . For a lattice 
realization we now have to choose values for L/a, go and bare quark mass m® 
as well as coefficients for the improvement terms in the action. We take the 
latter as smooth functions of g$ either by a perturbative expression or by a 
non-perturbative fit ||. The mass mo is fixed by demanding zero PCAC-mass 
0. Hence we may approach the continuum limit by a sequence of lattices 
with growing L/a and go adjusted to maintain a fixed value #sf- Conceptually 
this is exactly the same situation as in our quenched computations. The 
regularizing lattice spacing a varies while renormalized physics is held fixed. 
It is on such 'trajectories', that we study algorithm performance. 

Our most extensive simulations of the 0(a) improved Schrodinger func- 
tional have been conducted with the well-known hybrid Monte Carlo method 
(HMC) ||. In our implementation we took advantage of preconditioning and 
the refinement proposed in || . It amounts to the introduction of two different 
step sizes for fermion-gluon and gluonic self-couplings in an approximately 
optimal proportion depending on their relative computational cost. In other 



long runs we applied the polynomial hybrid Monte Carlo (PHMC) fllQ| , II 



Here, as for the multiboson technique [0, an approximately inverting poly 



nomial of the Dirac operator is used to bosonize the theory. In the multiboson 
proposal the resulting action is represented by many boson fields with near- 
est neighbour couplings. For unimproved Wilson fermions finite step-size 
updates are employed, which however become impractical when the clover 
term is included - the case on that we concentrate here. A further disad- 
vantage is the additional slowing down due to collective effects of the many 



bosons |n|. With PHMC the operator polynomial is employed to construct 
a non-local Gaussian action for only one boson field which is simulated by 
HMC. The imperfection of the polynomial can be corrected by an acceptance 
or reweighting step. Some results are reported which have been obtained by 
a recently proposed multi- level Metropolis procedure (MLM) |14]]. Further 
details of the various algorithms will be given below. 



2 Algorithms in this study 

In this section we briefly describe our implementations of fermion Monte 
Carlo algorithms as they are benchmarked in this study. With each of them 
the goal is the inclusion of effects of the weight factor det(Q) 2 which arises 
from integrating two degenerate flavours of quarks out of ( |1 . 1|) , 

exp(-r) = J D[U] exp(-S gaugc [f/]) det(Q) 2 . (2.1) 

Here the hermitian operator Q for Sheikholeslami-Wohlert improved Wilson 
quarks has the structure 

Q = C075M; M = 1 - T - H. (2.2) 

The constant Cq is chosen to contain the eigenvalues of Q in the interior 
of the interval (—1,1). The matrix M contains nearest neighbour hopping 
terms in H and the clover term in T, which is diagonal with respect to the 
lattice index. The detailed form of these components, including boundary 



improvement, can for instance be found in ref. [15] 



2.1 Hybrid Monte Carlo 

The HMC method || has so far been the most popular fermion algorithm 
for QCD. In choosing a trajectory length of unity we followed the general ex- 
perience that this is close to optimal. In [T(J this was confirmed for the 
quenched Schrodinger functional, and a test with dynamical fermions at 
L/a = 8 showed an almost doubling of computational costs for g$ F as we 
lowered the trajectory length to one half. We reduced discretization errors 
by the multiple time scale method proposed in ref. || taking the version 
given there in eq. (6.7) with n = 4. A test of the performance gain of 
the above integration scheme in practical simulations was performed in |T7| , 



where it was demonstrated that a substantial gain is achieved as compared 
to a standard leap-frog integrator. The value of n was not varied any further 
in this study. 

As an essential sophistication we made use of two different forms of pre- 
conditioning. Both rely on our ability to factor out of Q matrix factors which 
on the one hand are easy to invert and on the other hand capture a part of its 
spectral variation to leave us with a better conditioned remaining factor. For 



even-odd preconditioning we exhibit the block structure of M with respect 
to even (e) and odd (o) lattice sites and factorize 

/ M ee M eo \ / M cc \ ( 1 M; e l M eo \ 

" V M oe M 00 J \M oe 1 J V ° M °o - MoeM-^Meo J ' l -° J 

where the left (lower) block-triangular factor and the block-diagonal M ee are 
easy to invert. This factorization can now be used in a two-fold way. If the 
original Q under the determinant in ( j2.1| ) is plugged into the HMC algorithm 
we have to continuously solve linear systems with coefficient matrices given 
by Q. With Q2.3| ) these can be transformed into better conditioned systems 
with accelerated iterative inversion of 

Q = c 7 5 (M oo - M oe M; e l M co ). (2.4) 

The constant Co is again used to normalize the spectrum of Q. On the other 



hand we may also conclude from (|2.3| ) that up to irrelevant constant factors 
the relation 

det(Q) oc det(M ee ) det(Q) (2.5) 

holds. Now Q enters into the HMC and leads to a different Monte Carlo 
dynamics, which also takes det(M ee ) into account. When we refer to even- 
odd preconditioning in this paper, this second variant will always be meant. 
Further details on our implementation of HMC may be found in [|H| . As we 
only have to invert the squared operator Q 2 we use the conjugate gradient 
method (CG), which was found to be close to optimal in this case. 

For SSOR preconditioning a different factorization of M based on factors 
triangular with respect to a lexicographic ordering of lattice sites is used 



19| , |20| , |21| . Due to its complexity, in particular if the clover term is included, 
this has to our knowledge only been used to accelerate linear systems and 
was for that purpose reported to be superior over even-odd preconditioning if 



combined with the BiCGstab |22| inversion algorithm for the preconditioned 



M and M^ . In the following SSOR will refer to such an implementation. For 
the unimproved case, a simplified form of SSOR preconditioning (ILU) was 
implemented under the determinant with very positive results [ [Tcfl . 

We compared the performance of our two HMC program versions on our 
largest lattice, i.e. 12 4 at (3 — 9.5. In solving the linear systems with the 
respective preconditioned operators we confirmed that in terms of operations 
associated with applying these operators to fields, the BiCGstab algorithm 



with SSOR preconditioning outperforms the CG algorithm with even-odd 
preconditioning by a factor of about 1.6. Part of this advantage is however 
lost in terms of CPU time, because on our Alenia Quadrics (APE) machines 
inner products are relatively expensive. Since in the BiCGstab algorithm 
inner products and linear combinations are much more frequent than in the 
CG algorithm, this is a non-negligible overhead. The overall advantage that 
we find for the even-odd version derives however from the different operators 
under the determinant. A clear sign of this is the behaviour of the acceptance 
rate in both cases. While for the even-odd preconditioned determinant we 
could obtain an acceptance rate of 91% with a step size of At = 0.08, for 
det(Q 2 ) it went down to 75% already at a step size of At = 0.07. 

In principle, one could also conceive of the following combination yet 
untested by us. One uses the even-odd preconditioned determinant and, 
when linear systems with Q have to be solved, one transforms them to the 
SSOR preconditioned form, solves, and translates back. It is unclear at 
present, whether the overhead still leaves this variant profitable. 

2.2 Polynomial Hybrid Monte Carlo 

We recall here some basics of the PHMC algorithm. For technical details the 
reader is referred to refs. p3[ ^4fl . In the PHMC algorithm the inverse of 
Q 2 is approximately computed by a suitable [HJ Chebyshev polynomial of 
degree n, 

<r 2 « PnAQ 2 )- (2-6) 

Defining the relative deviation 

R n JX) = XP n JX) - 1, (2.7) 

the inversion error for eigenvalues A G [e, 1] of Q 2 is bounded by 

. r-\ n+l 



5 = sup|i? n , e (A)| =2 1^-^=1 . (2.8) 

For a given degree n the free parameter e in P n ^ allows to trade between 
approximation range and accuracy. For eigenvalues A < e the error mono- 
tonically moves from -R„ )e (e) = —S to i? ne (0) = — 1. With the help of P n>e we 
represent the determinant by a bosonic spinor field (pseudofermion) 

det(Q 2 ) = f D[(f)]D[(P j } exp{-S P )W (2.9) 



with the Gaussian action 

S P = ^P n , e (Q 2 [U])0 (2.10) 

and the remainder 

W = det(Q 2 P„„ e (Q 2 )) (2.11) 

rendering ( [2.9|) exact. As long as the spectrum of Q 2 is in the approximation 
range [e, 1], W is a small correction close to one. Expectation values in the 
full QCD ensemble are now given by reweighting with W as 

where O is some observable and the average (■ ■ -)p is taken with the ac- 
tion S'gaugo + Sp. Since W is still given by a determinant, a straightforward 
evaluation is hard. As it is a small correction, however, stochastic (unbi- 
ased) estimates should be adequate. For each measurement we construct an 
estimator W given by 

W = -J- J>xp U(l ~ [Q'PnAQ 2 )}- 1 ^} (2-13) 



v corr 



with independent Gaussian random fields fy. Averaging over N COTT such es- 
timates allows us to reduce and control the extra noise inflicted here. The 
true QCD average is then estimated by eq. (|2.12|) with W replaced by W. 

The update of the gauge field and the pseudofermionic field follows the 
standard HMC pattern with global heatbath for and molecular dynamics 
for U including the speedup from || discussed before. This is chosen, be- 
cause, in contrast to the multiboson algorithm [0, finite step size updates 



for U are impractical here due to the complicated non-local effective action. 
At this point the parameters n, e and, less importantly, N COTT are at our 
disposal for optimization. For small eigenvalues the growth of the 'inverter' 
-Pn, e (A) ~ 1/A is cut off at A ~ e. For the HMC dynamics e hence, in 
some sense, takes over the role of the smallest eigenvalue. It was found ad- 
vantageous |23, 0] to choose e a few times larger than the typical smallest 
eigenvalue of Q 2 . This allows us to keep the degree of the polynomial lower 
for the same approximation accuracy. Configurations with small eigenval- 
ues of Q 2 are produced more frequently than they would be with the exact 



determinantal weight. As the algorithm is still exact, this is precisely com- 
pensated by W or respectively W giving smaller weight to the observables 
evaluated on these configurations. It should be borne in mind that the un- 
quenched lattice path integral is always well-defined. The potential problem 
with nearly 'exceptional' configurations is a statistical one with rarely sam- 
pled large contributions, which is alleviated by our sampling and reweighting 
technique. This is the reason for us to prefer the reweighting correction over 
an acceptance step. 

There is a special round-off problem for PHM C that we briefly summarize 
now with more details available in [f23], P5j. To generate <p with action ( |2.10| ) 
it is necessary to factorize 

P n ^Q 2 ) = F n {Q)^F n {Q) (2.14) 

with an n'th degree polynomial F n . For gauge field updating ^-derivatives 
of Sp have to be taken at fixed <fi. To this end we factorize further 

F n (Q)<P = [V^(Q - r n )] [^~i(Q - r„_i)] ■ ■ • [V^iiQ - n)}c/> (2.15) 

and store the occurring subproducts to facilitate the force computation. 
While the complex roots r^ are determined by P n ,e, the real factors y/ck only 
serve to prevent the partial products from growing too large or too small. It 
is known [^5| that the evaluation of a high order matrix polynomial in factor- 



ized form is in principle rather susceptible to round-off error. In particular, 
the ordering of factors in ( |2.15| ) is of crucial importance in this context. In 
P3"| , |2"o| orderings were found which make this source of errors negligible for 
the runs with n up to 46 reported in this paper, even on the 32-bit APE 100 
machines. Further details on tuning the polynomial parameters are deferred 
to appendix B. 

Another variant of PHMC could be devised by applying an inverting 
polynomial to the complex spectrum of 75 Q instead of the real positive Q 2 . 



A corresponding multiboson algorithm was investigated in ref. ||26|| . We shall 
investigate this method in the near future. 

2.3 Multi-Level Metropolis Algorithm 

As for the previous algorithms it is our aim to represent det(Q 2 ) oc det (M^M) 
in a way suitable for simulation. Here this will be done in part by the explicit 
use of a few terms of the hopping parameter expansion in powers of T + H 



(see eq. (|2.2|)) and by integrals over a collection of pseudofermion fields to 
represent the remainder. 

The series for the logarithm of the determinant is given by 

logdet(M) = trln(M) = -tr(T + H) - -tr(T + H) 2 - -tr(T + Hf + . . . . 

(2.16) 
In our algorithm, we separate off the series up to some order k. In the 
present work we found it convenient to use k = 3, since tr(T + H) 1 = trT 4 
can easily be computed for i — 1, 2, 3. At higher orders also mixed terms 
would contribute. In order to deal with the remaining terms, we define 

M = M exp ( J2 ~ ( T + H ) j ) ■ (2-17) 

For the inverse of M we introduce a hierarchical approximation by polyno- 
mials Pi of order rii in T + H, 

j 
M~ l = Y[Pp + 0((T + H) n > +1 ). (2.18) 



i=l 



The full required inversion accuracy is reached for the maximal value j = 
I. This is used with r% + r 2 + . . . + r\ pseudofermion fields to derive the 
representation 



l Ti 



det(MtM) oc /nn^[4] D[<t>ia\ exp I -^IP^,,.! 2 ) . (2.19) 

i=l a=l \ i,a / 

The powers r; > 1 are advantageous as they lead to a smaller force on the 



gauge field which allows larger update steps |T4 



Let us summarize the action that is simulated, 

S = S gaugc [U] + S hop [U] + S PF [U, 0] (2.20) 

with| 

S hop [U] = trT + -trT 2 + -trT 3 (2.21) 



trT is small but nonzero, as we also included boundary improvement terms in it. 



and 

S PF [U, ( f)] = J2\^&a\ 2 ■ (2-22) 

i.a 

We now consider S gauge [U] as a zeroth approximation which is taken into 
account in generating finite stepsize primary update proposals. All further 
terms will eventually be implemented by accept-reject filters. In the first 

level action 

n 

S {1) = 5 gaugc + S hop + J2 l p i <M* (2-23) 

0=1 

the hopping term is included as it would otherwise lead to a single link action 
too complex for level zero. All further levels j > 1 are straightforwardly given 
by 

s (3) = s (j-i) + Y^\P 3 4 >ja \ 2 (2.24) 

o=l 
up to j = I. 

We now describe the steps of the multi-level Metropolis update scheme. 
A proposal at level is given by multiple local updates with the Cabibbo- 
Marinari heatbath or the overrelaxation algorithm. One possibility is to 



randomly select v links to be updated. In ref. |1 | we found it advantageous 
to actually update several times a sublattice chosen at random from a set 
that covers the lattice. The size of the sublattices is chosen such that a 
reasonable acceptance at level one of the algorithm is achieved. The whole 
proposal obeys detailed balance with respect to the level zero action as we 
reverse the order of link updates with probability 1/2. The remaining levels 
now proceed recursively as follows. Generate a proposal U by performing tj 
update steps at level j — 1. Accept U as new configuration at level j with 
the probability 

A® = mm\l,exp(-AS U) {U} + AS U) {U])] , (2.25) 

where AS® = S® - S^-V . 

In our implementation, the auxiliary fields <fti a are kept fixed during the 
update cycle of the gauge field after they have been generated by a global 
heatbath by solving 

&,a = P~\ (2.26) 

where rj is a Gaussian random field. 
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In its present form the MLM algorithm will not be the method of choice 
for large lattices. The reason is that its cost will ultimately grow proportional 
to the square of the number of lattice sites. This is because the size of the 
updated blocks cannot grow while maintaining the acceptance rate and thus 
their number is proportional to the volume. Each evaluation of AS^ is 
however also of order volume in complexity. On the other hand, as we shall 
see shortly, MLM can produce precise results at L/a = 5 where we have 
tested it here. As it contains interesting elements, for instance being a finite 
step-size method for improved dynamical fermions, we still found the idea 
and the practical test worth reporting and comparing with other methods 
here, for instance as a basis for further modification. 

3 Benchmarks of algorithmic performance 

3.1 Our measure of efficiency 

We now define two quantities, M cost and D cost , which allow to compare sim- 
ulation costs for a certain physics output between different algorithms and 
parameters. The first measure is machine dependent and refers to actual 
CPU time on the APE100 line of parallel computers currently in use by the 
ALPHA Collaboration. The second one is machine independent with the 
number of Dirac operator applications to a spinor field being our currency. 
As a target quantity, whose statistical accuracy is used for weighing costs in 
either units, we take our coupling g§ F . The precise definitions are 

M:ost = (update time in seconds on machine M) 

x (error of l/gl F f x (4a/T)(4a/L) 3 (3.1) 

and 

-Dcost = (number of applications Q(fi) x (error of l/gl F ) 2 . (3.2) 

Since the squared error in both formulas goes down inversely proportional 
to the run length, both quantities, extracted from given Monte Carlo simu- 
lations, do not depend on their lengths. The reason for focusing on absolute 
errors of l/<7g F * s as follows. We assume for the purpose of error analysis 
only that the running of #sf with L has the structure of 1-loop perturbation 
theory, 1/<?sf ~ — 26 logL + const. Then 

6 -^ = ^(l/gl F )^8 5(l/gl F ) (3.3) 
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holds, and we approximately aim at a certain relative scale uncertainty, in- 
dependently of the size of #sf- In ( |3.1|) the trivial growth proportional to 
the number of lattice sites is cancelled such that both quantities scale in 
the same way. The reference machine M in this publication will always be 
the smallest 8-node machine of type Ql. Most of our data actually come 
from bigger machines with up to 512 nodes. The costs on these machines 
are converted by multiplying naively by the ratio of nodes, e.g. 512/8. This 
means, we neglect communication overheads, which is a small effect on our 
hardware and implementation. Note that with our definitions costs can be 
meaningfully compared also under trivial (replica) parallelization, of which 
we make extensive use. 

While most of the CPU time with dynamical fermions is spent on appli- 
cations of the Dirac operator there is also quite some overhead from other 
operations, in particular on small lattices. This was neglected in D cost except 
for the contribution to the gauge field force in the PHMC algorithm which 
is proportional to the polynomial degree n. As a consequence, the ratio 
D cost /M cost varies between 50% and 80% of the theoretical value referring to 
Q4> operations only. Applications of Q and Q are so close to each other both 
in theoretical complexity and actual CPU time, that we neglect their differ- 
ence. The SSOR preconditioned operator, on the other hand, is counted as 
4/3 Q operations due to extra multiplications of the diagonal (clover) part. 

A typical run that entered our benchmarks is entry 12k in Table 1. With 
M cost ~ 12 and D cost m 3000 we ran a total of 13000 trajectories on an 
L/a = T/a =12 lattice and achieved 6 % scale accuracy in about 6 days on 
256 nodes of APE100. 

3.2 Numerical results 

In Table 1 we list the most important parameters for a subset of our HMC and 
PHMC runs performed to investigate the QCD running gauge coupling with 
two massless flavours^. The fourth column indicates which of the algorithms 
discussed before was used. This table has to be read in conjunction with 
Table 2, where the measured values for M cost and D cost are given. Results 
for -D C ost of MLM follow in Table 4 below. Error estimates for the costs 
stem from the error of t 1iAi of g^p which is determined by the method of 

3 The physical implications of these data will be analyzed in a separate paper J2JJ while 
here we focus on algorithmic aspects. 
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Appendix A. Its value here refers to a unit given by complete update cycles 
(trajectories). The update time in seconds for one such cycle is normalized 
to the APE100-Q1 as discussed with the definition of M cost . In Fig. 1 we plot 
M cost against L/a for all our runs at a fixed scale L in physical units that is 
implicitly determined by the condition g$ F « 1.1. The line corresponding to 
a growth proportional to a -3 is shown as a reference and roughly represents 
the rise of the data. This combines effects of a growing variance of our 
observable in the Schrodinger functional for g| F and of critical slowing down. 
The latter accounts for about two powers of l/a. In Fig. 2 the number 
of conjugate gradient iterations is shown for our even-odd preconditioned 
HMC runs. At least for smaller p| F there is an approximately linear growth 
with L/a which contributes one power to critical slowing down. This is the 
expected behaviour since \/L is the infrared cutoff here analogous to the 
quark mass in other applications. At larger coupling Nqg moderately rises 
in the range that has been explored here. 

The actual cost to determine the running coupling at fixed error for g~ F as 
discussed before hence seems to roughly grow like l/a 7 in the continuum limit, 
at least at the relatively weak coupling considered here. This seems to be 
more optimistic than the quark mass dependence in some previous estimates, 



for instance^ in [p7| . One reason may be that our growth may be slightly 
underestimated due to overhead on the small lattices. Closer inspection 
reveals as another source of difference that in our molecular dynamics steps 
we are not forced to lower the step size At at the rate usually estimated 
for constant acceptance while lowering the quark mass. Keeping L however 
fixed in physical units means that (3 rises when a/L becomes smaller which 
makes the gauge field smoother at the same time. This could lead to smaller 
discretization errors and partly be responsible for the observed behaviour. 
The integration method [[| may in addition interfere with standard scaling 
on intermediate size lattices. 

The vertical dotted line in Fig. 1 is located at L/a = 16 and points 
to M cost ~ 30. This implies about 100 days on 512 nodes for 3% scale 
accuracy. Thus at least with the next generation of APE1000 machines a 
serious continuum calculation should be within reach. For L/a = 12, our 
most expensive lattices so far, we found a slight preference for PHMC This 
conclusion holds also among the larger couplings simulated, with the costs 
being approximately independent of g| F in the present range between 1.1 

4 See also the discussion in [Es|. 
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and 1.8. All runs with L/a = 10 have been performed employing HMC with 
SSORf] and reach up to much larger couplings. Here the growth of M cost and 
-Dcost with gg F becomes clearly visible but not dramatic, see Fig. 3. 

With the MLM algorithm for improved fermions we only have results for 
L/a = 5 whose parameters are collected in Table 3. As updates on level zero 
(gauge action only) we used heatbath sweeps and overrelaxation sweeps over 
certain sets of links. With probability 1/2 we update either set A or set B. 
Set A consists of all spatial links of one randomly selected timeslice together 
with the temporal links at this timeslice in either positive or negative time 
direction. Set B are all links with a randomly selected spatial direction plus 
the temporal boundary links. In the case of set A we perform a heatbath 
sweep over all links followed by five overrelaxation sweeps. In the case of set 
B we perform a heatbath sweep over the temporal boundary links followed by 
five overrelaxation sweeps over the spatial links of one direction. In both cases 
the order of the updating is exactly reversed with probability 1/2 to fulfill 
detailed balance. Given the large set of free parameters, some of them had 
to be chosen rather ad hoc. As we implemented and ran MLM on PCs but 
not on APE100, to which our M cost -values refer, we only quote the machine 
independent D cost here. Together with results for the coupling, which were 
found consistent with HMC results, they are given in Table 4. 

4 Conclusions 

We have studied several simulation algorithms for the Schrodinger functional 
of full QCD with two flavours of massless quarks. Due to relatively high 
statistics on lattices up to 12 4 we obtained precise information on integrated 
autocorrelation times. Although our results are relatively close for the al- 
gorithms compared, there is a slight advantage for the polynomial hybrid 
Monte Carlo for our parameter range and observable. For this numerical 
reason and for the expected advantages from its modified sampling at larger 
coupling, we plan to focus on PHMC for our coming runs with L/a = 16. In 
simulations with ordinary HMC we found even-odd preconditioning of the 
determinant more efficient than SSOR preconditioning the solver alone. 
Acknowledgement We would like to thank Rainer Sommer for numerous 
discussions and for a critical reading of the manuscript. We are grateful to 



5 Our even-odd preconditioned (P)HMC code is unsuitable for this lattice size due to 
machine topology. 
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A Integrated autocorrelation time for func- 
tions of observables 

In this appendix we discuss a method to assess the effect of autocorrela- 
tions on the statistical error of nonlinear functions of simple expectation 
values. We consider a number of observables in a statistical system, and 
by A a , a = 1,2,... we denote their exact mean values. For each observable 
we have a chain of iV unbiased but (auto-) correlated Monte Carlo estimates 
a l a ,i = 1,...,N. Assume that we want to estimate F = f(A a ), where / 
is an in principle arbitrary function. A simple case arising in the context 
of reweighting is the quotient F = A\jA%, while fit parameters extracted 
from a correlation function at a sequence of separations would be a more 
complicated case. 

The obvious estimator for F is given by f(a a ), where 



1 N 



(A.l) 



are the ensemble-means for our simulation. In a correct and equilibrated 
Monte Carlo we expect 

(A a -a a ) = 0, (A.2) 

((A a -a a ) 2 ) = 0(1/JV) (A.3) 

to hold, where the expectation values in this appendix mean the average over 
an infinite number of identical Monte Carlo simulations of length N. Loosely 
speaking, A a and d a differ by 0(l/yN) in an individual Monte Carlo run. 
By Taylor expanding / around the argument A a we find 

(F-f(a a )) = O(l/A0, (A.4) 

a 2 = ((F - f(a a )) 2 ) = 0(1/N). (A.5) 

The first line reveals the (in general unavoidable) bias of our estimator which 
has to be suppressed^ by large enough N. The statistical error a of order 

6 In principle it is also possible to cancel the leading bias-term, for instance by the 
jackknife method. 
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l/\f~N will be discussed in the following. With the gradient vectors 

Hp = f\p(A a ), h p = f lt3 (a a ) (A.6) 

we define projected observables 

A H = Y,A a H a (A.7) 

a 

and analogously with h a and for a a . Now we conclude that up to higher 
orders in 1/iV we just need to know the variance of the projected observable, 

a 2 ^{(A H -a H f)^{{A- h -a- h f). (A.8) 

In practice, the projection can only be performed with h, taken from the 
data, of course. 

From here on one may proceed just like in the case of simple expectation 
values. We may estimate the relevant autocorrelation function at separation 
t for instance^ as 

N-t / , N-t 



i=l \ j=l 

From it the error follows as 




j r (o). 



a 2 = -^2r int (A. 10) 



with 



w 
i 

Tint 



-+> -7-4- A - n 

2 ^r(o) l ; 



The summation window VF is usually chosen large enough that T int saturates 
to a constant within statistical errors. Often this can be achieved by selfcon- 
sistently summing until W/Tint reaches numbers like 5 ... 10. Below the role 
of W will be discussed further. In summary, the deviation of 2r int from one for 
the projected observable describes the complete effect of (auto) correlations 
on the estimation of F. Obviously, a, T and r int all depend on the function 
f(A a ) which has remained implicit in our notation. 



7 Less symmetrically, one might also subtract a^ in each bracket. 
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We would like to conclude this appendix by indicating the advantage of 
explicitly summing T compared to the jackknife binning procedure which is 
often applied for the estimation of errors of secondary quantities like best- 
fit parameters. There one divides N estimates into N/B bins of length B. 
The individual bins are treated as uncorrelated. The fact that this is not 
exactly true leads to a systematic error in the error estimate which is of order 
t/B from the correlation of neighbouring bins. Here r is a general scale of 
autocorrelation times involved. This is usually controlled by demanding a 
plateau of the errors as the bin length is varied. The statistical uncertainty of 
the error estimate is of order y/B/N. These two errors have to be balanced 
at an optimal bin length, which incidentally scales asBoc (iVr 2 ) 1 / 3 . 

If we sum T up to separation W (summation window), systematic errors 
due to the neglected remainder are of order exp(— W/t). The statistical error 
of the error estimate is expected to be of order y/W/N from the number of 



independent windows. In fact, Madras and Sokal quote the formula [30 



£d- = Jm+£ , ( A.i2) 

Tint V iV 

which follows if one approximates the required summed 4-point autocorre- 
lation function by its disconnected part which falls back to the sum over V 
entering into r^t itself. In practice, these errors usually look very reason- 
able under repeated runs. The conclusion is that the systematic errors for 
the "summation method" are much smaller, which, in balancing systematic 
with statistical errors, leads to more accurate error estimates. Taking the 
idea of balancing totally seriously, one would conclude that the "error of 
the error" decays like [1/iV] 1 / 3 with binning and with [h^AQ/iV] 1 / 2 with the 
T-summation method. 



B Tuning of the PHMC algorithm 

Here we summarize our strategy for tuning the free parameters of the PHMC 
algorithm, in particular for the Schrodinger functional at small volume or 
weak coupling. We are interested in the error 0Ar corr of the estimate 

«» = ». (B,) 

(W)p 
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for some observable O (mainly g SF at present) and W given in eq. ( 2.13| ). All 



errors are assessed by the method of the previous appendix. Here the error 
can be decomposed as 

£ + [<4-<£] + [<HL* -<£,]■ ( R2 ) 

Here a 2 ^ corresponds to N corr = oo, or equivalently to the use of the exact 

r 2 



reweighting with W as in eq. ( 2.12 ), while dp refers to the simple mean (O)p. 
The second term in ( |B.2| ) reflects a contribution from "ideal" reweighting 
and hence from the imperfection of the polynomial approximation, while the 
third one is due to our non-ideal stochastic estimation of the correction. Both 
would vanish for a perfect polynomial and are naively proportional to 5 2 , the 
scale of polynomial errors. The third term has a factor l/iV corr in addition. 
We can roughly disentangle them by measuring (O)p and (O) and their 
errors for at least two values of N COTT . The goal now is to take N COTT = 1 ... 4 
and find 5, e such that the reweighting part of the error remains acceptable, 
less than 20%, say. This is to be achieved at the smallest possible value of 
the polynomial degree n. 

In |24]] we found for j3 between 5.4 and 6.8 on 8 3 x 16 the rule 5 ~ 0.01 and 
e « 2A min to be very efficient, where A min is the average smallest eigenvalue of 
Q 2 . At the larger j3- values of the present study we expect the intrinsic fluc- 
tuations caused by the gauge field to be smaller and correspondingly found a 
too large relative contribution from the third term in ( |B.2j ) when the above 
tuning is employed. Instead we found it much more efficient to attenuate 
this term with a smaller 5. This turned out to be possible essentially with- 
out enlarging n, i.e. we could allow e to grow even larger than 2A min . This 
is probably due to smaller fluctuations of the small eigenvalues as well. In 
Table 5 and 6 we report simulation parameters of the PHMC runs on 12 4 . 
As to the choice of Co it is noted that one has to ensure A max < 1 for all 
configurations for a numerically stable evaluation of ( |2.15| ). On the other 
hand, the efficiency is not very sensitive to the precise value of (A max ) which 
we hence kept safely below one. 
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set L/a (3 algorithm At P. c 



ace 



12a 12 9.5 HMC (e/o) 0.080 0.91 

12b 12 9.5 HMC (SSOR) 0.060 0.86 

12c 12 9.5 HMC (SSOR) 0.070 0.75 

12d 12 9.5 PHMC (e/o) 0.091 0.83 

12e 12 9.5 PHMC (e/o) 0.100 0.76 

12f 12 8.5 HMC (e/o) 0.070 0.93 

12g 12 8.5 PHMC (e/o) 0.091 0.80 

12h 12 8.5 PHMC (e/o) 0.114 0.75 

12i 12 7.5 HMC (e/o) 0.075 0.89 

12j 12 7.5 PHMC (e/o) 0.098 0.77 

12k 12 7.5 PHMC (e/o) 0.098 0.78 



10a 


10 


9.3884 


HMC (SSOR) 


0.080 


0.78 


10b 


10 


8.39 


HMC (SSOR) 


0.080 


0.75 


10c 


10 


7.3619 


HMC (SSOR) 


0.070 


0.79 


lOd 


10 


6.877 


HMC (SSOR) 


0.070 


0.72 


lOe 


10 


6.5 


HMC (SSOR) 


0.060 


0.80 


lOf 


10 


6.0 


HMC (SSOR) 


0.050 


0.83 


lOg 


10 


5.5 


HMC (SSOR) 


0.040 


0.82 


8a 


8 


9.2364 


HMC (e/o) 


0.080 


0.96 


8b 


8 


8.2373 


HMC (e/o) 


0.120 


0.91 


8c 


8 


7.2103 


HMC (SSOR) 


0.100 


0.71 



6a 6 9.5 HMC (e/o) 0.110 0.97 

6b 6 9.0 HMC (e/o) 0.110 0.97 

6c 6 8.5 HMC (e/o) 0.100 0.97 

6d 6 7.5 HMC (e/o) 0.070 0.98 

5a 5 9.3884 HMC (SSOR) 0.120 0.92 

5b 5 7.3619 HMC (SSOR) 0.120 0.90 



4a 


4 


9.2364 


HMC (e/o) 


0.130 


0.98 


4b 


4 


8.24 


HMC (e/o) 


0.120 


0.98 


4c 


4 


7.21 


HMC (e/o) 


0.200 


0.93 



Table 1: Summary of the simulated parameter sets, which enter our perfor- 
mance studies of dynamical fermion algorithms. Quark masses are close to 
their critical values. 
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set 



9sf 



Tint 



t 



update 



/[*: 



M, 



cost 



^cost/10 2 



12a 


1.1 


2.00(16) 


1582 


15(1) 


36(3) 


12b 


1.1 


1.98(23) 


2392 


20(2) 


30(4) 


12c 


1.1 


2.82(27) 


1773 


22(2) 


37(4) 


12d 


1.1 


1.41(10) 


1636 


12.9(9) 


30(2) 


12e 


1.1 


1.35(7) 


1335 


12.0(6) 


28(2) 


12f 


1.3 


2.19(12) 


1768 


15.0(8) 


38(2) 


12g 


1.3 


1.16(7) 


1518 


18.3(1.1) 


44(3) 


12h 


1.3 


2.47(15) 


1103 


11.6(7) 


27(2) 


12i 


1.7 


2.35(14) 


1895 


14.7(9) 


39(2) 


12j 


1.7 


2.05(14) 


1554 


12.6(9) 


32(2) 


12k 


1.7 


1.81(13) 


1371 


11.6(9) 


28(2) 


10a 


1.1 


1.95(7) 


707 


9.7(4) 


16.2(6) 


10b 


1.3 


2.35(10) 


719 


10.7(5) 


17.5(8) 


10c 


1.7 


2.63(13) 


804 


11.5(6) 


18.7(9) 


lOd 


2.1 


3.42(20) 


806 


14.4(8) 


25(1) 


lOe 


2.4 


3.46(21) 


991 


16(1) 


28(2) 


lOf 


3.3 


3.42(21) 


1274 


19(1) 


33(2) 


lOg 


5.4 


3.94(22) 


1932 


30(2) 


49(3) 


8a 


1.1 


1.14(4) 


260 


4.0(1) 


8.2(3) 


8b 


1.3 


1.40(6) 


183 


3.0(1) 


6.2(3) 


8c 


1.7 


2.40(9) 


239 


5.8(2) 


8.4(3) 


6 a 


1.0 


0.70(2) 


49 


1.00(3) 


1.98(6) 


6b 


1.1 


0.75(2) 


49 


1.04(3) 


2.09(6) 


6 c 


1.2 


0.80(2) 


55 


1.15(3) 


2.34(6) 


6d 


1.5 


1.08(2) 


74 


2.08(4) 


4.50(8) 


5 a 


1.0 


0.67(1) 


22 


0.70(1) 


1.03(2) 


5b 


1.5 


1.02(3) 


22 


0.83(2) 


1.26(4) 


4a 


1.0 


0.53(1) 


6 


0.268(5) 


0.428(8) 


4b 


1.2 


0.57(1) 


7 


0.271(5) 


0.445(8) 


4c 


1.5 


0.74(1) 


5 


0.222(3) 


0.397(5) 



Table 2: Cost estimates for the simulations in the parameter sets specified 
in Table |1[ Precise results and analysis on the running of g§ F will appear in 



[29 . 
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M 
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20 
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Figure 1: Measured values of M cost for runs with constant physics fixed by 
g| F ~ 1.1 and vanishing quark mass. 
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Figure 2: Average number of iterations of the conjugate gradient algorithm 
in even-odd preconditioned HMC runs. 



(3 T\ m n 2 n 3 n 4 t\ t 2 t 3 t 4 



a i 



a 2 



a 3 



a,\ 



8.39 4 15 31 63 255 1 6 6 10 0.313 0.828 0.966 0.998 
8.854 4 15 31 63 255 1 6 6 10 0.347 0.835 0.969 0.999 

9.40 3 11 23 63 255 1 8 10 10 0.355 0.742 0.923 0.996 



Table 3: Parameters of the MLM update cycle, where / = 4, r 2 = r 3 = r 4 = 1 
and Oj denotes the acceptance rate at the level j. 
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Figure 3: Measured values of M cost for runs with HMC-SSOR on L/a 
lattices versus renormalized coupling. 



10 



13 8.39 



8.854 



9.40 



g 2 SF 1.1807(12) 
D cost 290(6) 



1.0778(10) 0.9767(16) 
287(6) 372(15) 



Table 4: -D cos t for simulations with the multi-level algorithm of 5 4 lattices 
at various /3-values. 
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set 


/9 


i "corr 


e 


n 


8 


12d 


9.5 


4 


0.0050 


44 


0.0034 


12e 


9.5 


2 


0.0050 


44 


0.0034 


12g 


8.5 


4 


0.0044 


44 


0.0050 


12h 


8.5 


2 


0.0069 


42 


0.0016 


12j 


7.5 


4 


0.0050 


46 


0.0026 


12k 


7.5 


2 


0.0050 


46 


0.0026 



Table 5: Parameters controlling the polynomial approximation to Q 2 for 
PHMC runs on 12 4 . 



set 



X^min/ 



\^max/ 



Co 



a 



M 



\ u P 



12d 


0.002477(15) 


0.8582(1) 


0.6738653 


1.05 


12e 


0.002474(12) 


0.8582(1) 


0.6738653 


1.10 


12g 


0.002284(14) 


0.8765(1) 


0.6686256 


1.35 


12h 


0.002270(13) 


0.8765(1) 


0.6686256 


1.14 


12j 


0.001869(14) 


0.8664(1) 


0.6477127 


1.06 


12k 


0.001860(15) 


0.8667(1) 


0.6477127 


1.03 



Table 6: Average smallest and largest eigenvalue of Q 2 (see eq. ( p.4[ ) for c ) 
and ratio of errors in eq. ( [B . 2| ) . 



